[WIP] Add benchmark scripts to different moe gemms #1315

fzyzcjy · 2025-07-24T05:41:36Z

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

gemini-code-assist

Summary of Changes

Hello @fzyzcjy, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the benchmarking capabilities for Mixture-of-Experts (MoE) General Matrix Multiplications (GEMMs) by introducing a new, more granular profiling utility. It integrates this utility into existing benchmarks to provide detailed kernel-level performance metrics, expands the range of tested configurations, and standardizes the output format for easier data processing. Additionally, it adjusts project dependencies.

Highlights

Enhanced Benchmarking Utility: Introduced a new bench_kineto utility function in flashinfer/testing/utils.py for detailed CUDA kernel profiling. This utility leverages PyTorch's Kineto profiler to measure specific kernel execution times, offering more granular performance insights than general execution time measurements.
Integration of Kineto Profiling: The bench_cutlass_fused_moe.py script has been updated to utilize the new bench_kineto utility. This allows for precise measurement of gemm1 and gemm2 kernel durations within the fused MoE operation, providing deeper analysis of performance bottlenecks.
Expanded Benchmark Configurations: The bench_cutlass_fused_moe.py script now includes additional BATCH_SIZES (384, 768) and a more dynamic generation of test_configs for num_experts, covering a wider range of expert counts (e.g., 288 // 1 down to 288 // 32). This broadens the scope of the benchmarks.
Structured Benchmark Output: The benchmark results in bench_cutlass_fused_moe.py are now printed as a JSON string, including batch_size, num_experts, top_k, intermediate_size, and the measured time_gemm1_us and time_gemm2_us. This provides machine-readable and easily parsable output for automated analysis.
Dependency Adjustment: The nvidia-cudnn-frontend dependency has been commented out in setup.py, indicating it may no longer be a required installation for the project.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces benchmark scripts for MoE GEMMs and adds a new benchmarking utility. The changes are a good step towards performance analysis.

My review focuses on improving code quality, maintainability, and robustness. I've identified a potential resource leak in the new suppress_stdout_stderr utility, which is important to fix. I've also pointed out several opportunities to clean up the code by removing unused imports, temporary comments, and commented-out code blocks, which will improve the overall maintainability of the new benchmark scripts.

gemini-code-assist · 2025-07-24T05:44:02Z

flashinfer/testing/utils.py

+class suppress_stdout_stderr:
+    def __enter__(self):
+        self.outnull_file = open(os.devnull, 'w')
+        self.errnull_file = open(os.devnull, 'w')
+
+        self.old_stdout_fileno_undup = sys.stdout.fileno()
+        self.old_stderr_fileno_undup = sys.stderr.fileno()
+
+        self.old_stdout_fileno = os.dup(sys.stdout.fileno())
+        self.old_stderr_fileno = os.dup(sys.stderr.fileno())
+
+        self.old_stdout = sys.stdout
+        self.old_stderr = sys.stderr
+
+        os.dup2(self.outnull_file.fileno(), self.old_stdout_fileno_undup)
+        os.dup2(self.errnull_file.fileno(), self.old_stderr_fileno_undup)
+
+        sys.stdout = self.outnull_file
+        sys.stderr = self.errnull_file
+        return self
+
+    def __exit__(self, *_):
+        sys.stdout = self.old_stdout
+        sys.stderr = self.old_stderr
+
+        os.dup2(self.old_stdout_fileno, self.old_stdout_fileno_undup)
+        os.dup2(self.old_stderr_fileno, self.old_stderr_fileno_undup)
+
+        os.close(self.old_stdout_fileno)
+        os.close(self.old_stderr_fileno)
+
+        self.outnull_file.close()
+        self.errnull_file.close()


The current implementation of suppress_stdout_stderr is not exception-safe, which can lead to resource leaks. Use a try...finally block or contextlib.ExitStack to ensure resources are always cleaned up correctly, even in case of errors.

Suggested change

class suppress_stdout_stderr:

def __enter__(self):

self.outnull_file = open(os.devnull, 'w')

self.errnull_file = open(os.devnull, 'w')

self.old_stdout_fileno_undup = sys.stdout.fileno()

self.old_stderr_fileno_undup = sys.stderr.fileno()

self.old_stdout_fileno = os.dup(sys.stdout.fileno())

self.old_stderr_fileno = os.dup(sys.stderr.fileno())

self.old_stdout = sys.stdout

self.old_stderr = sys.stderr

os.dup2(self.outnull_file.fileno(), self.old_stdout_fileno_undup)

os.dup2(self.errnull_file.fileno(), self.old_stderr_fileno_undup)

sys.stdout = self.outnull_file

sys.stderr = self.errnull_file

return self

def __exit__(self, *_):

sys.stdout = self.old_stdout

sys.stderr = self.old_stderr

os.dup2(self.old_stdout_fileno, self.old_stdout_fileno_undup)

os.dup2(self.old_stderr_fileno, self.old_stderr_fileno_undup)

os.close(self.old_stdout_fileno)

os.close(self.old_stderr_fileno)

self.outnull_file.close()

self.errnull_file.close()

def __exit__(self, *_):

sys.stdout = self.old_stdout

sys.stderr = self.old_stderr

os.dup2(self.old_stdout_fileno, self.old_stdout_fileno_undup)

os.dup2(self.old_stderr_fileno, self.old_stderr_fileno_undup)

os.close(self.old_stdout_fileno)

os.close(self.old_stderr_fileno)

self.outnull_file.close()

self.errnull_file.close()

gemini-code-assist · 2025-07-24T05:44:03Z

benchmarks/bench_cutlass_fused_moe.py

+import os
+import sys
+import time


Consider removing the unused sys and time imports to maintain clean and readable code.

Suggested change

import os

import sys

import time

import json

import os

import torch

gemini-code-assist · 2025-07-24T05:44:03Z

benchmarks/bench_cutlass_fused_moe.py

@@ -35,7 +40,9 @@
    96,
    128,
    256,
+    384, # NOTE ADD


Remove this temporary comment, along with the blocks of commented-out code (e.g., lines 84-97, 215-227), to improve code clarity.

Suggested change

384, # NOTE ADD

384,

gemini-code-assist · 2025-07-24T05:44:03Z

benchmarks/bench_cutlass_fused_moe.py

+    print(f"MAIN_OUTPUT=" + json.dumps(dict(
+        batch_size=batch_size,
+        num_experts=num_experts,
+        top_k=top_k,
+        intermediate_size=intermediate_size,
+        time_gemm1_us=time_gemm1 * 1e6,
+        time_gemm2_us=time_gemm2 * 1e6,
+    )))


Use a single f-string for the whole line to improve readability.

print(f"MAIN_OUTPUT={\"batch_size\":{batch_size},\"num_experts\":{num_experts},\"top_k\":{top_k},\"intermediate_size\":{intermediate_size},\"time_gemm1_us\":{time_gemm1 * 1e6},\"time_gemm2_us\":{time_gemm2 * 1e6}}")

gemini-code-assist · 2025-07-24T05:44:03Z

flashinfer/testing/utils.py

+    prof_lines = profiler.key_averages().table(sort_by='cuda_time_total', max_name_column_width=100).split('\n')
+    # print(f"prof_lines=\n" + "\n".join(prof_lines))


Remove the commented-out print statement to improve clarity.

gemini-code-assist · 2025-07-24T05:44:03Z

setup.py

+    # NOTE MODIFIED rm
+    # "nvidia-cudnn-frontend",


Remove this temporary comment and the commented-out line.

fzyzcjy added 17 commits July 24, 2025 10:08

more

a5390df

more

94c6d27

more

64b0032

more

cfcc7ce

more

0e63dbe

more

72ec2c4

more

8caafde

more

5cbc7d0

more

ab252c7

more

8629008

more

4624913

more

9de0387

more

c89cf7c

more

adf2cb1

more

57eda4a

more

8b606b8

more

7b78322

gemini-code-assist bot reviewed Jul 24, 2025

View reviewed changes

fzyzcjy added 11 commits July 24, 2025 13:44

more

8c53c59

more

c4c8a07

more

a19273c

more

a129862

more

3d37611

more

62eaa46

more

dbecb4a

more

91863e4

more

3baee7e

more

57a18b6

more

806cf11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Add benchmark scripts to different moe gemms #1315

[WIP] Add benchmark scripts to different moe gemms #1315

Uh oh!

fzyzcjy commented Jul 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 24, 2025

Uh oh!

gemini-code-assist bot Jul 24, 2025

Uh oh!

gemini-code-assist bot Jul 24, 2025

Uh oh!

gemini-code-assist bot Jul 24, 2025

Uh oh!

gemini-code-assist bot Jul 24, 2025

Uh oh!

gemini-code-assist bot Jul 24, 2025

Uh oh!

Uh oh!

		prof_lines = profiler.key_averages().table(sort_by='cuda_time_total', max_name_column_width=100).split('\n')
		# print(f"prof_lines=\n" + "\n".join(prof_lines))

[WIP] Add benchmark scripts to different moe gemms #1315

Are you sure you want to change the base?

[WIP] Add benchmark scripts to different moe gemms #1315

Uh oh!

Conversation

fzyzcjy commented Jul 24, 2025

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!